Project Dataset, Machine Learning Course. \ Master Degree in Artificial Intelligence and Computer Science. \ a.y. 2021/2022\ \ Group 404 Name Not Found
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
plt.style.use('ggplot')
dt = pd.read_csv('compas-scores.csv', sep=',')
dt
| id | name | first | last | compas_screening_date | sex | dob | age | age_cat | race | ... | vr_offense_date | vr_charge_desc | v_type_of_assessment | v_decile_score | v_score_text | v_screening_date | type_of_assessment | decile_score.1 | score_text | screening_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | miguel hernandez | miguel | hernandez | 2013-08-14 | Male | 1947-04-18 | 69 | Greater than 45 | Other | ... | NaN | NaN | Risk of Violence | 1 | Low | 2013-08-14 | Risk of Recidivism | 1 | Low | 2013-08-14 |
| 1 | 2 | michael ryan | michael | ryan | 2014-12-31 | Male | 1985-02-06 | 31 | 25 - 45 | Caucasian | ... | NaN | NaN | Risk of Violence | 2 | Low | 2014-12-31 | Risk of Recidivism | 5 | Medium | 2014-12-31 |
| 2 | 3 | kevon dixon | kevon | dixon | 2013-01-27 | Male | 1982-01-22 | 34 | 25 - 45 | African-American | ... | 2013-07-05 | Felony Battery (Dom Strang) | Risk of Violence | 1 | Low | 2013-01-27 | Risk of Recidivism | 3 | Low | 2013-01-27 |
| 3 | 4 | ed philo | ed | philo | 2013-04-14 | Male | 1991-05-14 | 24 | Less than 25 | African-American | ... | NaN | NaN | Risk of Violence | 3 | Low | 2013-04-14 | Risk of Recidivism | 4 | Low | 2013-04-14 |
| 4 | 5 | marcu brown | marcu | brown | 2013-01-13 | Male | 1993-01-21 | 23 | Less than 25 | African-American | ... | NaN | NaN | Risk of Violence | 6 | Medium | 2013-01-13 | Risk of Recidivism | 8 | High | 2013-01-13 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 11752 | 11753 | patrick hamilton | patrick | hamilton | 2013-09-22 | Male | 1968-05-02 | 47 | Greater than 45 | Other | ... | NaN | NaN | Risk of Violence | 1 | Low | 2013-09-22 | Risk of Recidivism | 3 | Low | 2013-09-22 |
| 11753 | 11754 | raymond hernandez | raymond | hernandez | 2013-05-17 | Male | 1993-06-24 | 22 | Less than 25 | Caucasian | ... | NaN | NaN | Risk of Violence | 5 | Medium | 2013-05-17 | Risk of Recidivism | 7 | Medium | 2013-05-17 |
| 11754 | 11755 | dieuseul pierre-gilles | dieuseul | pierre-gilles | 2014-10-08 | Male | 1981-01-24 | 35 | 25 - 45 | Other | ... | NaN | NaN | Risk of Violence | 3 | Low | 2014-10-08 | Risk of Recidivism | 4 | Low | 2014-10-08 |
| 11755 | 11756 | scott lomagistro | scott | lomagistro | 2013-12-03 | Male | 1986-12-04 | 29 | 25 - 45 | Caucasian | ... | NaN | NaN | Risk of Violence | 2 | Low | 2013-12-03 | Risk of Recidivism | 3 | Low | 2013-12-03 |
| 11756 | 11757 | chin yan | chin | yan | 2014-01-11 | Male | 1982-02-19 | 34 | 25 - 45 | Asian | ... | NaN | NaN | Risk of Violence | 1 | Low | 2014-01-11 | Risk of Recidivism | 1 | Low | 2014-01-11 |
11757 rows × 47 columns
dt.shape
(11757, 47)
dt.size
552579
print(dt.columns.tolist())
['id', 'name', 'first', 'last', 'compas_screening_date', 'sex', 'dob', 'age', 'age_cat', 'race', 'juv_fel_count', 'decile_score', 'juv_misd_count', 'juv_other_count', 'priors_count', 'days_b_screening_arrest', 'c_jail_in', 'c_jail_out', 'c_case_number', 'c_offense_date', 'c_arrest_date', 'c_days_from_compas', 'c_charge_degree', 'c_charge_desc', 'is_recid', 'num_r_cases', 'r_case_number', 'r_charge_degree', 'r_days_from_arrest', 'r_offense_date', 'r_charge_desc', 'r_jail_in', 'r_jail_out', 'is_violent_recid', 'num_vr_cases', 'vr_case_number', 'vr_charge_degree', 'vr_offense_date', 'vr_charge_desc', 'v_type_of_assessment', 'v_decile_score', 'v_score_text', 'v_screening_date', 'type_of_assessment', 'decile_score.1', 'score_text', 'screening_date']
dt.dtypes
id int64 name object first object last object compas_screening_date object sex object dob object age int64 age_cat object race object juv_fel_count int64 decile_score int64 juv_misd_count int64 juv_other_count int64 priors_count int64 days_b_screening_arrest float64 c_jail_in object c_jail_out object c_case_number object c_offense_date object c_arrest_date object c_days_from_compas float64 c_charge_degree object c_charge_desc object is_recid int64 num_r_cases float64 r_case_number object r_charge_degree object r_days_from_arrest float64 r_offense_date object r_charge_desc object r_jail_in object r_jail_out object is_violent_recid int64 num_vr_cases float64 vr_case_number object vr_charge_degree object vr_offense_date object vr_charge_desc object v_type_of_assessment object v_decile_score int64 v_score_text object v_screening_date object type_of_assessment object decile_score.1 int64 score_text object screening_date object dtype: object
dt.head(10)
| id | name | first | last | compas_screening_date | sex | dob | age | age_cat | race | ... | vr_offense_date | vr_charge_desc | v_type_of_assessment | v_decile_score | v_score_text | v_screening_date | type_of_assessment | decile_score.1 | score_text | screening_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | miguel hernandez | miguel | hernandez | 2013-08-14 | Male | 1947-04-18 | 69 | Greater than 45 | Other | ... | NaN | NaN | Risk of Violence | 1 | Low | 2013-08-14 | Risk of Recidivism | 1 | Low | 2013-08-14 |
| 1 | 2 | michael ryan | michael | ryan | 2014-12-31 | Male | 1985-02-06 | 31 | 25 - 45 | Caucasian | ... | NaN | NaN | Risk of Violence | 2 | Low | 2014-12-31 | Risk of Recidivism | 5 | Medium | 2014-12-31 |
| 2 | 3 | kevon dixon | kevon | dixon | 2013-01-27 | Male | 1982-01-22 | 34 | 25 - 45 | African-American | ... | 2013-07-05 | Felony Battery (Dom Strang) | Risk of Violence | 1 | Low | 2013-01-27 | Risk of Recidivism | 3 | Low | 2013-01-27 |
| 3 | 4 | ed philo | ed | philo | 2013-04-14 | Male | 1991-05-14 | 24 | Less than 25 | African-American | ... | NaN | NaN | Risk of Violence | 3 | Low | 2013-04-14 | Risk of Recidivism | 4 | Low | 2013-04-14 |
| 4 | 5 | marcu brown | marcu | brown | 2013-01-13 | Male | 1993-01-21 | 23 | Less than 25 | African-American | ... | NaN | NaN | Risk of Violence | 6 | Medium | 2013-01-13 | Risk of Recidivism | 8 | High | 2013-01-13 |
| 5 | 6 | bouthy pierrelouis | bouthy | pierrelouis | 2013-03-26 | Male | 1973-01-22 | 43 | 25 - 45 | Other | ... | NaN | NaN | Risk of Violence | 1 | Low | 2013-03-26 | Risk of Recidivism | 1 | Low | 2013-03-26 |
| 6 | 7 | marsha miles | marsha | miles | 2013-11-30 | Male | 1971-08-22 | 44 | 25 - 45 | Other | ... | NaN | NaN | Risk of Violence | 1 | Low | 2013-11-30 | Risk of Recidivism | 1 | Low | 2013-11-30 |
| 7 | 8 | edward riddle | edward | riddle | 2014-02-19 | Male | 1974-07-23 | 41 | 25 - 45 | Caucasian | ... | NaN | NaN | Risk of Violence | 2 | Low | 2014-02-19 | Risk of Recidivism | 6 | Medium | 2014-02-19 |
| 8 | 9 | steven stewart | steven | stewart | 2013-08-30 | Male | 1973-02-25 | 43 | 25 - 45 | Other | ... | NaN | NaN | Risk of Violence | 3 | Low | 2013-08-30 | Risk of Recidivism | 4 | Low | 2013-08-30 |
| 9 | 10 | elizabeth thieme | elizabeth | thieme | 2014-03-16 | Female | 1976-06-03 | 39 | 25 - 45 | Caucasian | ... | NaN | NaN | Risk of Violence | 1 | Low | 2014-03-16 | Risk of Recidivism | 1 | Low | 2014-03-16 |
10 rows × 47 columns
dt.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 11757 entries, 0 to 11756 Data columns (total 47 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 11757 non-null int64 1 name 11757 non-null object 2 first 11757 non-null object 3 last 11757 non-null object 4 compas_screening_date 11757 non-null object 5 sex 11757 non-null object 6 dob 11757 non-null object 7 age 11757 non-null int64 8 age_cat 11757 non-null object 9 race 11757 non-null object 10 juv_fel_count 11757 non-null int64 11 decile_score 11757 non-null int64 12 juv_misd_count 11757 non-null int64 13 juv_other_count 11757 non-null int64 14 priors_count 11757 non-null int64 15 days_b_screening_arrest 10577 non-null float64 16 c_jail_in 10577 non-null object 17 c_jail_out 10577 non-null object 18 c_case_number 11015 non-null object 19 c_offense_date 9157 non-null object 20 c_arrest_date 1858 non-null object 21 c_days_from_compas 11015 non-null float64 22 c_charge_degree 11757 non-null object 23 c_charge_desc 11008 non-null object 24 is_recid 11757 non-null int64 25 num_r_cases 0 non-null float64 26 r_case_number 3703 non-null object 27 r_charge_degree 11757 non-null object 28 r_days_from_arrest 2460 non-null float64 29 r_offense_date 3703 non-null object 30 r_charge_desc 3643 non-null object 31 r_jail_in 2460 non-null object 32 r_jail_out 2460 non-null object 33 is_violent_recid 11757 non-null int64 34 num_vr_cases 0 non-null float64 35 vr_case_number 882 non-null object 36 vr_charge_degree 882 non-null object 37 vr_offense_date 882 non-null object 38 vr_charge_desc 882 non-null object 39 v_type_of_assessment 11757 non-null object 40 v_decile_score 11757 non-null int64 41 v_score_text 11752 non-null object 42 v_screening_date 11757 non-null object 43 type_of_assessment 11757 non-null object 44 decile_score.1 11757 non-null int64 45 score_text 11742 non-null object 46 screening_date 11757 non-null object dtypes: float64(5), int64(11), object(31) memory usage: 4.2+ MB
dt.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| id | 11757.0 | 5879.000000 | 3394.097892 | 1.0 | 2940.0 | 5879.0 | 8818.0 | 11757.0 |
| age | 11757.0 | 35.143319 | 12.022894 | 18.0 | 25.0 | 32.0 | 43.0 | 96.0 |
| juv_fel_count | 11757.0 | 0.061580 | 0.445328 | 0.0 | 0.0 | 0.0 | 0.0 | 20.0 |
| decile_score | 11757.0 | 4.371268 | 2.877598 | -1.0 | 2.0 | 4.0 | 7.0 | 10.0 |
| juv_misd_count | 11757.0 | 0.076040 | 0.449757 | 0.0 | 0.0 | 0.0 | 0.0 | 13.0 |
| juv_other_count | 11757.0 | 0.093561 | 0.472003 | 0.0 | 0.0 | 0.0 | 0.0 | 17.0 |
| priors_count | 11757.0 | 3.082164 | 4.687410 | 0.0 | 0.0 | 1.0 | 4.0 | 43.0 |
| days_b_screening_arrest | 10577.0 | -0.878037 | 72.889298 | -597.0 | -1.0 | -1.0 | -1.0 | 1057.0 |
| c_days_from_compas | 11015.0 | 63.587653 | 341.899711 | 0.0 | 1.0 | 1.0 | 2.0 | 9485.0 |
| is_recid | 11757.0 | 0.253806 | 0.558324 | -1.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| num_r_cases | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| r_days_from_arrest | 2460.0 | 20.410569 | 74.354840 | -1.0 | 0.0 | 0.0 | 1.0 | 993.0 |
| is_violent_recid | 11757.0 | 0.075019 | 0.263433 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| num_vr_cases | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| v_decile_score | 11757.0 | 3.571489 | 2.500479 | -1.0 | 1.0 | 3.0 | 5.0 | 10.0 |
| decile_score.1 | 11757.0 | 4.371268 | 2.877598 | -1.0 | 2.0 | 4.0 | 7.0 | 10.0 |
dt.isnull().sum()
id 0 name 0 first 0 last 0 compas_screening_date 0 sex 0 dob 0 age 0 age_cat 0 race 0 juv_fel_count 0 decile_score 0 juv_misd_count 0 juv_other_count 0 priors_count 0 days_b_screening_arrest 1180 c_jail_in 1180 c_jail_out 1180 c_case_number 742 c_offense_date 2600 c_arrest_date 9899 c_days_from_compas 742 c_charge_degree 0 c_charge_desc 749 is_recid 0 num_r_cases 11757 r_case_number 8054 r_charge_degree 0 r_days_from_arrest 9297 r_offense_date 8054 r_charge_desc 8114 r_jail_in 9297 r_jail_out 9297 is_violent_recid 0 num_vr_cases 11757 vr_case_number 10875 vr_charge_degree 10875 vr_offense_date 10875 vr_charge_desc 10875 v_type_of_assessment 0 v_decile_score 0 v_score_text 5 v_screening_date 0 type_of_assessment 0 decile_score.1 0 score_text 15 screening_date 0 dtype: int64
dt.isna().sum()[dt.isna().sum()>0].sort_values().plot(kind='bar', ylabel='count null values', figsize=(15,5))
<AxesSubplot:ylabel='count null values'>
null_Values=dt.isnull().sum()/len(dt)
print(null_Values*100)
null_Values[null_Values>0].sort_values().plot(kind='bar', ylabel='percentage null values', figsize=(15,5))
id 0.000000 name 0.000000 first 0.000000 last 0.000000 compas_screening_date 0.000000 sex 0.000000 dob 0.000000 age 0.000000 age_cat 0.000000 race 0.000000 juv_fel_count 0.000000 decile_score 0.000000 juv_misd_count 0.000000 juv_other_count 0.000000 priors_count 0.000000 days_b_screening_arrest 10.036574 c_jail_in 10.036574 c_jail_out 10.036574 c_case_number 6.311134 c_offense_date 22.114485 c_arrest_date 84.196649 c_days_from_compas 6.311134 c_charge_degree 0.000000 c_charge_desc 6.370673 is_recid 0.000000 num_r_cases 100.000000 r_case_number 68.503870 r_charge_degree 0.000000 r_days_from_arrest 79.076295 r_offense_date 68.503870 r_charge_desc 69.014204 r_jail_in 79.076295 r_jail_out 79.076295 is_violent_recid 0.000000 num_vr_cases 100.000000 vr_case_number 92.498086 vr_charge_degree 92.498086 vr_offense_date 92.498086 vr_charge_desc 92.498086 v_type_of_assessment 0.000000 v_decile_score 0.000000 v_score_text 0.042528 v_screening_date 0.000000 type_of_assessment 0.000000 decile_score.1 0.000000 score_text 0.127584 screening_date 0.000000 dtype: float64
<AxesSubplot:ylabel='percentage null values'>
Numerical Attributes are:
It is not usefull to show the histograms of all these attributes. For example, id is a number to identify each record and it is an incremental counter. \
By analyzing the result of the function describe() the attributes num_r_cases and num_vr_cases are null. Also, decile_score.1 is a duplicate and the attributes is_recid and is_violent_recid are our class label that we want to transform them in boolean attributes. \
So we want to plot only the following numerical attributes:
numericDT = dt[dt.columns.difference(
['id', 'name', 'first', 'last', 'dob', 'age_cat', 'race','compas_screening_date', 'num_r_cases', 'num_vr_cases', 'sex', 'c_jail_in', 'c_jail_out', 'c_case_number', 'c_offense_date', 'c_arrest_date', 'c_charge_degree', 'c_charge_desc', 'r_case_number', 'r_charge_degree', 'r_offense_date', 'r_charge_desc', 'r_jail_in', 'r_jail_out', 'vr_case_number', 'vr_charge_degree', 'vr_offense_date', 'vr_charge_desc', 'v_type_of_assessment', 'v_score_text', 'v_screening_date', 'type_of_assessment', 'score_text', 'screening_date', 'decile_score.1'])]
numericDT.hist(figsize=(15,15))
plt.show()
numericDF = dt[dt.columns.difference(['id', 'name', 'first', 'last', 'dob', 'age_cat', 'race','compas_screening_date', 'num_r_cases', 'num_vr_cases', 'sex', 'c_jail_in', 'c_jail_out', 'c_case_number', 'c_offense_date', 'c_arrest_date', 'c_charge_degree', 'c_charge_desc', 'r_case_number', 'r_charge_degree', 'r_offense_date', 'r_charge_desc', 'r_jail_in', 'r_jail_out', 'vr_case_number', 'vr_charge_degree', 'vr_offense_date', 'vr_charge_desc', 'v_type_of_assessment', 'v_score_text', 'v_screening_date', 'type_of_assessment', 'score_text', 'screening_date', 'decile_score.1'])]
numericAttributes = numericDF.columns.difference(['is_recid', 'is_violent_recid', 'c_days_from_compas'])
for attribute in numericAttributes:
sn.histplot(x = dt[attribute], hue = 'is_recid', data = numericDF, kde=True)
plt.show()
numericDF = dt[dt.columns.difference(['id', 'name', 'first', 'last', 'dob', 'age_cat', 'race','compas_screening_date', 'num_r_cases', 'num_vr_cases', 'sex', 'c_jail_in', 'c_jail_out', 'c_case_number', 'c_offense_date', 'c_arrest_date', 'c_charge_degree', 'c_charge_desc', 'r_case_number', 'r_charge_degree', 'r_offense_date', 'r_charge_desc', 'r_jail_in', 'r_jail_out', 'vr_case_number', 'vr_charge_degree', 'vr_offense_date', 'vr_charge_desc', 'v_type_of_assessment', 'v_score_text', 'v_screening_date', 'type_of_assessment', 'score_text', 'screening_date', 'decile_score.1'])]
numericAttributes = numericDF.columns.difference(['is_recid', 'is_violent_recid', 'c_days_from_compas'])
for attribute in numericAttributes:
sn.histplot(x = dt[attribute], hue = 'is_violent_recid', data = numericDF, kde=True)
plt.show()
Non Numeric Attributes are:
Most of them are text description, like c_charge_desc, r_charge_desc, name, first and last. So they are not usefull for ou analysis. Other attributes are identification codes (such as id, c_charge_degree, vr_charge_degree) and some others have a unique value (like type_of_assessment and v_type_of_assessment) The majority of the other non numerical attributes are dates. The most relative attributes are sex, age_cat and race that we want to rename in ethnicity for moral reasons.
dt.rename(columns={'race': 'ethnicity'}, inplace=True)
dt['sex'] = dt['sex'].astype('category')
dt['ethnicity'] = dt['ethnicity'].astype('category')
dt['age_cat'] = dt['age_cat'].astype('category')
dt['score_text'] = dt['score_text'].astype('category')
dt['v_score_text'] = dt['v_score_text'].astype('category')
categoricalAttributes = ['sex', 'ethnicity', 'age_cat', 'score_text', 'v_score_text']
for attribute in categoricalAttributes:
val = dt[attribute].value_counts()
val.plot(kind = 'bar', figsize = (5, 5))
plt.ylabel('count')
plt.xlabel(attribute)
plt.show()
categoricalDT = dt[dt.columns.difference(numericAttributes)]
catAttributes = categoricalDT.columns.difference(['is_recid'])
for attribute in categoricalAttributes:
if attribute == 'age_cat' or attribute == 'ethnicity' or attribute == 'sex' or attribute == 'score_text' or attribute == 'v_score_text':
plt.figure(figsize = (10, 5))
sn.countplot(x = dt[attribute], hue = 'is_recid', data = categoricalDT)
plt.show()
categoricalDT = dt[dt.columns.difference(numericAttributes)]
catAttributes = categoricalDT.columns.difference(['is_violent_recid'])
for attribute in categoricalAttributes:
if attribute == 'age_cat' or attribute == 'ethnicity' or attribute == 'sex' or attribute == 'score_text' or attribute == 'v_score_text':
plt.figure(figsize = (10, 5))
sn.countplot(x = dt[attribute], hue = 'is_violent_recid', data = categoricalDT)
plt.show()
numericDT.plot(kind='box', subplots=True, sharex=False, sharey=False, figsize=(15, 27), layout=(5, 4))
plt.show()
sn.pairplot(numericDT, hue = 'is_recid')
plt.show()
sn.pairplot(numericDT, hue = 'is_violent_recid')
plt.show()
plt.figure(figsize=(15, 7))
sn.heatmap(dt.corr(), annot=True, cmap='magma', fmt='.2f')
plt.show()
First we want to make the date as categorical attributes.
Then using the function describe() we want to see the variability of these attributes. If the variability is low it will be more usefull to plot them.
The attributes describing the dates are:
dt['dob'] = dt['dob'].astype('category')
dt['compas_screening_date'] = dt['compas_screening_date'].astype('category')
dt['c_jail_in'] = dt['c_jail_in'].astype('category')
dt['c_jail_out'] = dt['c_jail_out'].astype('category')
dt['c_offense_date'] = dt['c_offense_date'].astype('category')
dt['c_arrest_date'] = dt['c_arrest_date'].astype('category')
dt['r_offense_date'] = dt['r_offense_date'].astype('category')
dt['r_jail_in'] = dt['r_jail_in'].astype('category')
dt['r_jail_out'] = dt['r_jail_out'].astype('category')
dt['vr_offense_date'] = dt['vr_offense_date'].astype('category')
dt['v_screening_date'] = dt['v_screening_date'].astype('category')
dt['screening_date'] = dt['screening_date'].astype('category')
datesAttributes = ['dob', 'compas_screening_date', 'c_jail_in', 'c_jail_out', 'c_offense_date', 'c_arrest_date', 'r_offense_date', 'r_jail_in', 'r_jail_out', 'vr_offense_date', 'v_screening_date', 'screening_date']
for attribute in datesAttributes:
val = dt[attribute].value_counts()
print(val.describe())
print("")
count 7800.000000 mean 1.507308 std 0.822150 min 1.000000 25% 1.000000 50% 1.000000 75% 2.000000 max 6.000000 Name: dob, dtype: float64 count 704.000000 mean 16.700284 std 6.775800 min 1.000000 25% 12.000000 50% 16.000000 75% 21.000000 max 39.000000 Name: compas_screening_date, dtype: float64 count 10577.0 mean 1.0 std 0.0 min 1.0 25% 1.0 50% 1.0 75% 1.0 max 1.0 Name: c_jail_in, dtype: float64 count 10517.000000 mean 1.005705 std 0.090252 min 1.000000 25% 1.000000 50% 1.000000 75% 1.000000 max 4.000000 Name: c_jail_out, dtype: float64 count 1036.000000 mean 8.838803 std 6.645770 min 1.000000 25% 1.000000 50% 9.000000 75% 14.000000 max 29.000000 Name: c_offense_date, dtype: float64 count 802.000000 mean 2.316708 std 1.641212 min 1.000000 25% 1.000000 50% 2.000000 75% 3.000000 max 9.000000 Name: c_arrest_date, dtype: float64 count 1090.000000 mean 3.397248 std 1.957536 min 1.000000 25% 2.000000 50% 3.000000 75% 4.000000 max 12.000000 Name: r_offense_date, dtype: float64 count 984.000000 mean 2.500000 std 1.493288 min 1.000000 25% 1.000000 50% 2.000000 75% 3.000000 max 9.000000 Name: r_jail_in, dtype: float64 count 953.000000 mean 2.581322 std 1.596328 min 1.000000 25% 1.000000 50% 2.000000 75% 3.000000 max 10.000000 Name: r_jail_out, dtype: float64 count 599.000000 mean 1.472454 std 0.777286 min 1.000000 25% 1.000000 50% 1.000000 75% 2.000000 max 6.000000 Name: vr_offense_date, dtype: float64 count 704.000000 mean 16.700284 std 6.775800 min 1.000000 25% 12.000000 50% 16.000000 75% 21.000000 max 39.000000 Name: v_screening_date, dtype: float64 count 704.000000 mean 16.700284 std 6.775800 min 1.000000 25% 12.000000 50% 16.000000 75% 21.000000 max 39.000000 Name: screening_date, dtype: float64
The attributes that we want to include in our analysis are the following:
decile_score and v_decile_score as numerical attributes. They are important to show the score correlation with the class labels is_recid and is_violent_recidsex, age_cat and race renamed asethnicity (for the same reasons explained before)c_offense_date, r_offense_date and vr_offense_date to find a way to predict on avarage how many days pass, for a criminal, to bacome recid.We decide to keep these attributes as a consequence of the previous step of Data Understanding. We decide to remove all the attributes with too high (typically IDs) or too low variability.
dt = dt[dt.columns.difference(['id', 'age', 'decile_score', 'v_decile_score', 'juv_fel_count', 'juv_misd_count', 'juv_other_count', 'priors_count', 'days_be_screening_arrest', 'c_days_from_compas', 'num_r_cases', 'r_days_from_arrest', 'num_vr_cases', 'name', 'first', 'last', 'dob', 'compas_screening_date', 'c_jail_in', 'c_jail_out', 'c_case_number', 'c_arrest_date', 'c_charge_degree', 'c_charge_desc', 'r_case_number', 'r_charge_degree', 'r_charge_desc', 'r_jail_in', 'r_jail_out', 'vr_case_number', 'vr_charge_degree', 'vr_charge_desc', 'v_type_of_assessment', 'v_screening_date', 'type_of_assessment', 'screening_date', 'days_b_screening_arrest', 'decile_score.1'])]
dt.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 11757 entries, 0 to 11756 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age_cat 11757 non-null category 1 c_offense_date 9157 non-null category 2 ethnicity 11757 non-null category 3 is_recid 11757 non-null int64 4 is_violent_recid 11757 non-null int64 5 r_offense_date 3703 non-null category 6 score_text 11742 non-null category 7 sex 11757 non-null category 8 v_score_text 11752 non-null category 9 vr_offense_date 882 non-null category dtypes: category(8), int64(2) memory usage: 412.9 KB
dt.shape
(11757, 10)
For simplicity and readability, we decided to rename the attribute age_cat in the following way:
dt['age_cat'].replace(to_replace='Less than 25', value='young', inplace=True)
dt['age_cat'].replace(to_replace='25 - 45', value='adult', inplace=True)
dt['age_cat'].replace(to_replace='Greater than 45', value='senior', inplace=True)
C:\Users\user\AppData\Local\Temp\ipykernel_1048\4122075813.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy dt['age_cat'].replace(to_replace='Less than 25', value='young', inplace=True) C:\Users\user\AppData\Local\Temp\ipykernel_1048\4122075813.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy dt['age_cat'].replace(to_replace='25 - 45', value='adult', inplace=True) C:\Users\user\AppData\Local\Temp\ipykernel_1048\4122075813.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy dt['age_cat'].replace(to_replace='Greater than 45', value='senior', inplace=True)
val = dt['age_cat'].value_counts()
val
adult 6649 senior 2668 young 2440 Name: age_cat, dtype: int64
Since this is a binary classifcation problem and our class labels are is_recid and is_violent_recid that are integer, by plotting them in the previous steps, we saw that the possible values are 0 and 1. So we decide to keep them without binarization.
We suppose that the value -1 for the attribute is_recid is a value to identify the unknown information, so we want to drop that records.
dt=dt[dt.is_recid!=-1]
val = dt['is_recid'].value_counts()
val
0 7335 1 3703 Name: is_recid, dtype: int64
dt.isna().sum()
age_cat 0 c_offense_date 1881 ethnicity 0 is_recid 0 is_violent_recid 0 r_offense_date 7335 score_text 11 sex 0 v_score_text 4 vr_offense_date 10156 dtype: int64
nullValues=dt.isnull().sum()/len(dt)
nullValues*100
age_cat 0.000000 c_offense_date 17.041131 ethnicity 0.000000 is_recid 0.000000 is_violent_recid 0.000000 r_offense_date 66.452256 score_text 0.099656 sex 0.000000 v_score_text 0.036238 vr_offense_date 92.009422 dtype: float64
We want to drop the rows with null score_text and v_score_text values.
dt=dt.drop(dt[dt['score_text'].isna()].index)
dt=dt.drop(dt[dt['v_score_text'].isna()].index)
dt.isna().sum()
age_cat 0 c_offense_date 1880 ethnicity 0 is_recid 0 is_violent_recid 0 r_offense_date 7326 score_text 0 sex 0 v_score_text 0 vr_offense_date 10145 dtype: int64
For the modeling and prediciton part of the main goal we will not keep the attributes about dates. We erase c_offense_date, r_offense_date and vr_offense_date
# MAIN DATASET TO REACH THE PRIMARY GOAL (predict if a defendant becomes a recid)
dt_cleaned=dt[dt.columns.difference(['c_offense_date','is_violent_recid','r_offense_date','v_score_text','vr_offense_date'])]
dt_cleaned.info()
dt_cleaned.to_csv('dt_cleaned.csv', index=False)
<class 'pandas.core.frame.DataFrame'> Int64Index: 11027 entries, 0 to 11756 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age_cat 11027 non-null category 1 ethnicity 11027 non-null category 2 is_recid 11027 non-null int64 3 score_text 11027 non-null category 4 sex 11027 non-null category dtypes: category(4), int64(1) memory usage: 216.0 KB
dt_cleaned.shape
(11027, 5)
# SECOND DATASET TO REACH THE SECOND GOAL (predict if a VIOLENT defendant becomes a VIOLENT recid)
dt_cleaned_v=dt[dt.columns.difference(['c_offense_date','r_offense_date','score_text','vr_offense_date'])]
dt_cleaned_v=dt_cleaned_v[dt_cleaned_v.is_recid!=0]
dt_cleaned_v=dt_cleaned_v[dt_cleaned_v.columns.difference(['is_recid'])]
dt_cleaned_v.info()
dt_cleaned_v.to_csv('dt_cleaned_v.csv', index=False)
<class 'pandas.core.frame.DataFrame'> Int64Index: 3701 entries, 2 to 11753 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age_cat 3701 non-null category 1 ethnicity 3701 non-null category 2 is_violent_recid 3701 non-null int64 3 sex 3701 non-null category 4 v_score_text 3701 non-null category dtypes: category(4), int64(1) memory usage: 72.9 KB
dt_cleaned_v.shape
(3701, 5)
Now we need two more dataset to work on the prediction number of days using two new dataset split the main dataset according to the rows with is_recid=1 and is_violent_recid=1, respectly to show the values for r_offense_date and vr_offense_date.
from datetime import datetime
# DATASET TO predict how many days pass to become a recid
dt_date_r=dt[dt.columns.difference(['is_violent_recid','v_score_text','vr_offense_date'])]
dt_date_r=dt_date_r[dt_date_r.is_recid!=0]
dt_date_r=dt_date_r[dt_date_r.columns.difference(['is_recid'])]
dt_date_r=dt_date_r.drop(dt_date_r[dt_date_r['c_offense_date'].isna() | dt_date_r['r_offense_date'].isna()].index)
dates_diff=[]
for r, c in zip(dt_date_r['r_offense_date'], dt_date_r['c_offense_date']):
dates_diff.append(((datetime.strptime(r, '%Y-%m-%d') - datetime.strptime(c, '%Y-%m-%d')).days))
dt_date_r['dates_diff_in_days']=dates_diff
dt_date_r=dt_date_r[dt_date_r.columns.difference(['c_offense_date','r_offense_date'])]
dt_date_r
| age_cat | dates_diff_in_days | ethnicity | score_text | sex | |
|---|---|---|---|---|---|
| 2 | adult | 160 | African-American | Low | Male |
| 3 | young | 64 | African-American | Low | Male |
| 7 | adult | 41 | Caucasian | Medium | Male |
| 12 | young | 736 | Caucasian | Low | Male |
| 14 | young | 128 | African-American | Medium | Male |
| ... | ... | ... | ... | ... | ... |
| 11736 | senior | 286 | African-American | Medium | Male |
| 11738 | young | 296 | Caucasian | Low | Female |
| 11746 | young | 9 | African-American | Low | Male |
| 11751 | senior | 30 | African-American | Low | Male |
| 11753 | young | 513 | Caucasian | Medium | Male |
3093 rows × 5 columns
dt_date_r.info()
dt_date_r.to_csv('dt_date_r.csv', index=False)
<class 'pandas.core.frame.DataFrame'> Int64Index: 3093 entries, 2 to 11753 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age_cat 3093 non-null category 1 dates_diff_in_days 3093 non-null int64 2 ethnicity 3093 non-null category 3 score_text 3093 non-null category 4 sex 3093 non-null category dtypes: category(4), int64(1) memory usage: 61.0 KB
# DATASET TO predict how many days pass to become a violent recid
dt_date_v=dt[dt.columns.difference(['score_text','r_offense_date'])]
dt_date_v=dt_date_v[dt_date_v.is_recid!=0]
dt_date_v=dt_date_v[dt_date_v.is_violent_recid!=0]
dt_date_v=dt_date_v[dt_date_v.columns.difference(['is_recid', 'is_violent_recid'])]
dt_date_v=dt_date_v.drop(dt_date_v[dt_date_v['c_offense_date'].isna() | dt_date_v['vr_offense_date'].isna()].index)
dates_diff_v=[]
for vr, c in zip(dt_date_v['vr_offense_date'], dt_date_v['c_offense_date']):
dates_diff_v.append(((datetime.strptime(vr, '%Y-%m-%d') - datetime.strptime(c, '%Y-%m-%d')).days))
dt_date_v['dates_diff_in_days']=dates_diff_v
dt_date_v=dt_date_v[dt_date_v.columns.difference(['c_offense_date','vr_offense_date'])]
dt_date_v
| age_cat | dates_diff_in_days | ethnicity | sex | v_score_text | |
|---|---|---|---|---|---|
| 2 | adult | 160 | African-American | Male | Low |
| 12 | young | 736 | Caucasian | Male | Medium |
| 22 | adult | 242 | Caucasian | Male | Low |
| 36 | adult | 659 | African-American | Male | Low |
| 39 | adult | 296 | African-American | Male | Medium |
| ... | ... | ... | ... | ... | ... |
| 11675 | young | 217 | African-American | Male | Medium |
| 11678 | adult | 252 | African-American | Male | Low |
| 11680 | adult | 926 | African-American | Male | Medium |
| 11683 | senior | 741 | African-American | Male | High |
| 11696 | adult | 337 | Caucasian | Male | Low |
728 rows × 5 columns
dt_date_v.info()
dt_date_v.to_csv('dt_date_v.csv', index=False)
<class 'pandas.core.frame.DataFrame'> Int64Index: 728 entries, 2 to 11696 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age_cat 728 non-null category 1 dates_diff_in_days 728 non-null int64 2 ethnicity 728 non-null category 3 sex 728 non-null category 4 v_score_text 728 non-null category dtypes: category(4), int64(1) memory usage: 14.8 KB